{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Molecule featurizers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from chemprop.featurizers.molecule import (\n",
    "    MorganBinaryFeaturizer,\n",
    "    MorganCountFeaturizer,\n",
    "    RDKit2DFeaturizer,\n",
    "    V1RDKit2DFeaturizer,\n",
    "    V1RDKit2DNormalizedFeaturizer,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are example molecules to featurize."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from chemprop.utils import make_mol\n",
    "\n",
    "smis = [\"C\" * i for i in range(1, 11)]\n",
    "mols = [make_mol(smi, keep_h=False, add_h=False, ignore_stereo=False) for smi in smis]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Molecule vs molgraph featurizers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both molecule and [molgraph](./molgraph_molecule_featurizer.ipynb) featurizers take `rdkit.Chem.Mol` objects as input. Molgraph featurizers produce a `MolGraph` which is used in message passing. Molecule featurizers produce a 1D numpy array of features that can be used as [extra datapoint descriptors](../data/datapoints.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from chemprop.data import MoleculeDatapoint\n",
    "\n",
    "molecule_featurizer = MorganBinaryFeaturizer()\n",
    "\n",
    "datapoints = [MoleculeDatapoint(mol, x_d=molecule_featurizer(mol)) for mol in mols]\n",
    "\n",
    "molecule_featurizer(mols[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Morgan fingerprint featurizers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Morgan fingerprint can either use a binary or count representation of molecular structures. The radius of structures, length of the fingerprint, and whether to include chirality can all be customized. The default radius is 2, the default length is 2048, and chirality is included by default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((1024,), array([0, 0, 0, ..., 0, 0, 0], dtype=int32))"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mf = MorganCountFeaturizer(radius=3, length=1024, include_chirality=False)\n",
    "morgan_fp = mf(mols[0])\n",
    "morgan_fp.shape, morgan_fp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RDKit molecule featurizers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Chemprop gives a warning that the RDKit molecule featurers are not well scaled by a `StandardScaler`. Consult the literature for more appropriate scaling methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The RDKit 2D features can deviate signifcantly from a normal distribution. Consider manually scaling them using an appropriate scaler before creating datapoints, rather than using the scikit-learn `StandardScaler` (the default in Chemprop).\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.35978494,\n",
       "        0.        , 16.043     , 12.011     , 16.03130013,  8.        ,\n",
       "        0.        , -0.07755789, -0.07755789,  0.07755789,  0.07755789,\n",
       "        1.        ,  1.        ,  1.        , 12.011     , 12.011     ,\n",
       "       -0.07755789, -0.07755789,  0.1441    ,  0.1441    ,  2.503     ,\n",
       "        2.503     ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  8.73925103,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  7.42665278,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  7.42665278,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  7.42665278,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  7.42665278,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        1.        ,  1.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.6361    ,  6.731     ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "molecule_featurizer = RDKit2DFeaturizer()\n",
    "extra_datapoint_descriptors = [molecule_featurizer(mol) for mol in mols]\n",
    "extra_datapoint_descriptors[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The rdkit featurizers from v1 are also available. They rely on the `descriptastorus` package which can be found at [https://github.com/bp-kelley/descriptastorus](https://github.com/bp-kelley/descriptastorus). This package doesn't include the following rdkit descriptors: `['AvgIpc', 'BCUT2D_CHGHI', 'BCUT2D_CHGLO', 'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW', 'BCUT2D_MWHI', 'BCUT2D_MWLOW', 'SPS']`. Scaled versions of these descriptors are available, though it is unknown which molecules were used to fit the scaling, so this may be a dataleak depending on the test set used to evaluate model performace. See this [issue](https://github.com/bp-kelley/descriptastorus/issues/31) for more details about the scaling. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n",
      "[16:42:01] DEPRECATION WARNING: please use MorganGenerator\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([1.96075662e-05, 5.77173432e-04, 3.87525506e-15, 2.72296612e-11,\n",
       "       1.02515408e-07, 4.10254814e-13, 1.63521389e-11, 1.93930344e-05,\n",
       "       1.22824218e-06, 2.20907757e-07, 6.35349909e-07, 3.08677419e-06,\n",
       "       1.70338959e-05, 1.34072882e-05, 4.07488775e-10, 2.17523456e-08,\n",
       "       6.89356874e-07, 2.63048207e-01, 1.96742684e-02, 2.50993926e-11,\n",
       "       9.25841695e-11, 5.85610910e-17, 1.08871430e-06, 2.39145041e-11,\n",
       "       7.52245592e-13, 1.23345732e-08, 2.94906350e-01, 9.59992784e-03,\n",
       "       2.31947354e-03, 9.99390325e-01, 9.88006922e-01, 1.59186446e-08,\n",
       "       4.42180049e-09, 1.00000000e+00, 7.85198619e-13, 4.14332758e-13,\n",
       "       6.49617582e-11, 4.45588945e-06, 7.89307465e-03, 2.39990382e-02,\n",
       "       7.89307465e-03, 4.59284380e-03, 3.24286613e-10, 1.83192891e-02,\n",
       "       7.38491174e-01, 9.73505944e-01, 6.05575320e-02, 3.42737552e-07,\n",
       "       1.23284669e-08, 6.13163344e-02, 3.33304127e-02, 9.93858689e-22,\n",
       "       1.42492255e-01, 6.29631332e-02, 3.47228888e-02, 4.82992991e-15,\n",
       "       1.11775996e-02, 1.89758400e-02, 5.52866693e-02, 5.22997303e-05,\n",
       "       5.69516350e-08, 2.15229839e-03, 0.00000000e+00, 1.14242658e-21,\n",
       "       2.40245513e-23, 1.31105703e-02, 8.72153349e-03, 5.76142917e-21,\n",
       "       3.60875252e-15, 1.45980119e-01, 1.73556718e-22, 1.18093757e-10,\n",
       "       5.99833786e-02, 9.05498589e-08, 4.60978367e-10, 1.57072376e-01,\n",
       "       1.66847964e-01, 2.37240682e-02, 8.07601514e-02, 2.75008841e-02,\n",
       "       4.92845505e-03, 1.24459630e-01, 7.31816496e-02, 1.67096874e-01,\n",
       "       7.55810089e-02, 8.78622233e-24, 1.33643046e-01, 3.04494668e-02,\n",
       "       2.58369311e-02, 5.30138094e-05, 1.42657565e-16, 3.73160396e-02,\n",
       "       6.95272017e-13, 0.00000000e+00, 9.79690873e-13, 2.64281353e-04,\n",
       "       1.20493060e-11, 2.86305006e-09, 1.04578852e-01, 3.09944928e-02,\n",
       "       2.99487758e-06, 2.77639012e-01, 5.30138094e-05, 6.17138309e-03,\n",
       "       5.30138094e-05, 5.00000000e-01, 3.84710451e-01, 5.30138094e-05,\n",
       "       5.30138094e-05, 1.64664515e-01, 5.30138094e-05, 9.98653446e-01,\n",
       "       3.99820633e-01, 2.02868342e-02, 5.70867846e-19, 3.32362804e-10,\n",
       "       9.64197643e-10, 7.10542736e-15, 5.83707586e-13, 1.19880642e-20,\n",
       "       1.65079548e-01, 1.67040631e-01, 1.66498334e-01, 1.66486816e-01,\n",
       "       2.02864661e-01, 6.93658809e-02, 7.10542736e-15, 1.68346480e-01,\n",
       "       1.67982932e-01, 6.87189958e-10, 1.18157291e-03, 1.64332634e-01,\n",
       "       8.37776917e-04, 1.66325734e-01, 1.63034142e-01, 1.65079548e-01,\n",
       "       9.56970492e-08, 3.49708922e-08, 1.68206175e-01, 1.65806858e-01,\n",
       "       1.67346595e-01, 7.13964619e-07, 2.64115098e-12, 9.99127911e-02,\n",
       "       2.86809243e-10, 3.77737848e-01, 4.50616778e-03, 1.33250251e-01,\n",
       "       3.47299284e-02, 1.61482916e-09, 1.87517315e-18, 2.09410539e-07,\n",
       "       7.10542736e-15, 4.99264281e-01, 1.64929402e-01, 1.31744508e-17,\n",
       "       2.11164355e-16, 1.16815875e-09, 3.25923600e-22, 6.24601420e-10,\n",
       "       1.68149182e-01, 1.65450729e-01, 1.17110262e-13, 0.00000000e+00,\n",
       "       1.64668868e-01, 1.66924728e-01, 0.00000000e+00, 5.10071327e-08,\n",
       "       7.10542736e-15, 1.54654108e-01, 2.79420938e-22, 0.00000000e+00,\n",
       "       1.67639733e-01, 6.31499266e-25, 1.68186130e-01, 9.08850267e-03,\n",
       "       1.68363202e-01, 8.26542313e-11, 1.56346354e-01, 0.00000000e+00,\n",
       "       0.00000000e+00, 2.11354236e-02, 2.11354236e-02, 2.38815575e-20,\n",
       "       0.00000000e+00, 8.33672450e-25, 5.30138094e-05, 1.56951066e-01,\n",
       "       4.03434503e-08, 1.55259196e-23, 1.59306117e-17, 5.76610077e-14,\n",
       "       2.95798941e-11, 1.68378369e-01, 1.67380186e-01, 1.48151465e-18,\n",
       "       2.32414994e-16, 4.70359809e-08, 1.66633397e-01, 1.87492844e-01])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "molecule_featurizer = V1RDKit2DFeaturizer()\n",
    "molecule_featurizer = V1RDKit2DNormalizedFeaturizer()\n",
    "molecule_featurizer(mols[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Custom"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Any class that has a length and returns a 1D numpy array when given an `rdkit.Chem.Mol` can be used as a molecule featurizer. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from rdkit import Chem\n",
    "\n",
    "class MyMoleculeFeaturizer:\n",
    "    def __len__(self) -> int:\n",
    "        return 1\n",
    "\n",
    "    def __call__(self, mol: Chem.Mol) -> np.ndarray:\n",
    "        total_atoms = mol.GetNumAtoms()\n",
    "        return np.array([total_atoms])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mf = MyMoleculeFeaturizer()\n",
    "mf(mols[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using molecule features as extra datapoint descriptors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you only have molecule features for one molecule per datapoint, those features can be used directly as extra datapoint descriptors. If you have multiple molecules with extra features, or other extra datapoint descriptors, they first need to be concatenated into a single numpy array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "mol1_features = np.random.randn(len(mols), 1)\n",
    "mol2_features = np.random.randn(len(mols), 2)\n",
    "other_datapoint_descriptors = np.random.randn(len(mols), 3)\n",
    "\n",
    "extra_datapoint_descriptors = np.hstack([mol1_features, mol2_features, other_datapoint_descriptors])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
